Propagation graph estimation from individuals’ time series of observed states

Various things propagate through the medium of individuals. Some individuals follow the others and take the states similar to their states a small number of time steps later. In this paper, we study the problem of estimating the state propagation order of individuals from the real-valued state sequences of all the individuals.We propose a method of constructing a state propagation graph from individuals’ time series of observed states. The propagation order estimated by our proposed method is demonstrated to be significantly more accurate than that by a baseline method (optimal constant delay model) for our synthetic datasets, and also to be consistent with visually recognizable propagation orders for the dataset of Japanese stock price time series and biological cell firing state sequences.


Propagation graph estimation from individuals' time series of observed states Tatsuya Hayashi * & Atsuyoshi Nakamura
Various things propagate through the medium of individuals. Some individuals follow the others and take the states similar to their states a small number of time steps later. In this paper, we study the problem of estimating the state propagation order of individuals from the real-valued state sequences of all the individuals.We propose a method of constructing a state propagation graph from individuals' time series of observed states. The propagation order estimated by our proposed method is demonstrated to be significantly more accurate than that by a baseline method (optimal constant delay model) for our synthetic datasets, and also to be consistent with visually recognizable propagation orders for the dataset of Japanese stock price time series and biological cell firing state sequences.
Sometimes, it is very important to analyze how things such as vibration, heat, cell firing, information, virus and etc, propagated. The objectives of such analyses are diverse from identification of the sources and the propagation routes to learning a propagation model for prediction. Physical propagation such as vibration and heat follows physical law. However, biological propagation such as cell firing has more ambiguous propagation rules, and propagation through the medium of human beings such as information and virus propagation is more complex.
The state propagation from one individual to another individual can be seen as a simple causal relationship between them. Granger causality 1 and transfer entropy 2 are well-known methods for investigating the causal relationship between time series, and their extensions and applications have been still energetically investigated [3][4][5] . In these methods, a parameterized stationary model is assumed and long time series are needed for its parameter estimation. Contrary to the fact that these methods can deal with various kinds of influence, the state propagation treats only the influence of taking similar states with some delay. By virtue of this simplicity, propagation relation estimation does not need such long time series.
In this study, we propose an alignment-based method of estimating state propagation relationship between a pair of individuals from their time series of observed states. There already has existed an extended Granger causality method into which a kind of alignment called dynamic time warping (DTW) is incorporated to deal with the arbitrary-time-lag influence between time series 6 . Different from this Granger causality extension, we estimate time delays of the propagations and use them to estimate direct and indirect propagations. Time delay estimation among signals 7,8 has been studied well in the context of source localization, however, only constant time delays are dealt with there. We treat variable time delays and estimate time delay sum.
From the estimated state propagation relationships between all the pairs of individuals, we construct an estimated state propagation graph whose edges are composed of the estimated direct propagations only. As for propagations through networks, various information or influence propagations have been studied: word-ofmouth marketing 9-12 , epidemics [13][14][15] , innovation diffusion 16,17 and so on. In most of these studies, networks are assumed to be given and not needed to be estimated though there are studies on propagation probability estimation through edges in a given network [18][19][20][21][22] . Studies on propagation through social networks are popular [23][24][25] , but in most social networks, relation between users are visible and not needed to be estimated. Recently, methods to reconstruct a complex network from binary time series have been developed 26,27 , but those methods require the sufficient length of binary time series because they use the maximum-likelihood estimation of the probabilities associated with presence or absence of links.
In our proposed method, for each pair of individuals (i, j), we calculate the time delay sum of individual j's states from individual i's matched states averaged over all the minimum cost alignments between their state time series. Then, propagation direction between i and j is estimated as i → j if such averaged time delay sum is positive, and as j → i if it is negative. From individual pairs (i, j) with non-zero average time delay sum, we construct OPEN Graduate School of Information Science and Technology, Hokkaido University, Sapporo 060-0814, Japan. * email: thayashi@ist.hokudai.ac.jp Note that, considering that V is fixed to I, a solution of the above problem is estimation Ê of the directed edge set E. Alignment-based direction estimation. Let s i and s j be the state time series of individuals i and j. From s i and s j , we estimate in which direction i → j or j → i the states propagated. As an estimation method, we propose a method based on the sum of delay times at matched positions in the minimum cost alignments between s i and s j . An alignment of two time series s i and s j is a pair of two same length sequences s ′ i and s ′ j which are made from s i and s j , respectively, by inserting some values at some positions in s i and s j so as to take similar values at the same positions. As an inserted value, two types of values are considered: a gap in gap-based alignment and the same value as the previous-position's value in DTW(dynamic time warping)-based alignment. For example, one of gap-based alignments between two binary-state time series s i = 001000100 and s j = 000100010 is and one of DTW-based alignments between two real-state time series s i = 2210022310 and s j = 1221022231 is There are various alignments between a pair of time series but only the minimum cost alignments are considered for a given cost function w : for the length T ′ of the aligned sequences s ′ i and s ′ j . As for a cost function, we use the absolute difference w(x, y) = |x − y| in a DTW-based alignment. In a gap-based alignment, we use a problem dependent cost function. For example, in the case that Y = {0, 1} and each 1-state in one sequence is strongly www.nature.com/scientificreports/ preferred to be aligned to 1-state in the other sequence by shifting positions unless their position difference is large ( 2 × (position difference) > α ) or the number of 1-states is different, the following cost function seems to be appropriate: For the cost function (Eq. 2) with α = 3 , the cost of the alignment (Eq. 1) is 2, which is a minimum cost alignment. There are 6 minimum cost alignments between the time series s i = 001000100 and s j = 000100010 for the cost function. Let M(s ′ i , s ′ j ) denote the set of matched position pairs in the alignment (s ′ i , s ′ j ) . For example, (3,4), (4,5), (5, 6), (6, 7), (7,8), (9,9)} . Define the time delay of a matched position pair (t ′ i , t ′ j ) by t ′ j − t ′ i , and consider the time delay sum of s ′ j from s ′ i calculated by For example, the time delay sum of s ′ j from s ′ i for the alignment (Eq. 1) is 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 = 7 . The time delay sums for the other 5 minimum cost alignments are 5, 6, 6, 7, 8, so the time delay sum averaged over all the 6 minimum cost alignments is 6.5.
Using the average time delay sum of the minimum cost alignments, we estimate the direction of state propagation between individuals i and j by the following rule (E).
(E) The propagation direction is estimated as i → j if the time delay sum of s ′ j from s ′ i averaged over the minimum cost alignments (s ′ i , s ′ j ) between s i and s j is positive, and j → i if that is negative.
Edge set estimation. By rule (E), directions are decided for all the individual pairs but those with zero average time delay sum. If we let the estimated edge set Ê be the set of all (i, j) ∈ I × I with non-zero average time delay sum, the following two issues arise: P1 Ê contains many edges with small average time delay sum, which connects pairs of synchronized individuals. P2 Ê contains (i, j) for which individual i's state not directly but indirectly affects individual j's state through the medium of some other individual k.
As a countermeasure for P2, that is, in order to delete indirectly affecting edges, we define a candidate edge as an edge with average time delay sum larger than threshold θ and sort all the candidate edges by average time delay sum in descending order and greedily delete edge (i, j) one by one for which an indirect path from i to j exists. Threshold θ should be set to the estimated maximum average time delay sum of directly affecting edges. In the distribution over average time delay sum between all the individual pairs, average time delay sum between directly affecting pairs is considered to form the highest peak with high probability. So, we set θ to the first valley position larger than the highest peak position in the distribution of the average time delay sum estimated by kernel density estimation. For P1, we try to partition V into layers by classifying the synchronized individuals to the same layer, and then delete all the edges between vertices in the same layer. For a given graph G(V, E), define the 0-layer set V E 0 as the set of vertices with indegree 0. If there is no vertex with indegree 0, define V E 0 as the set of vertices for which the maximum average time delay sum among all the incoming edges is the smallest among those for all the vertices. Define the i-layer set V E i recursively as the set of vertices that do not belong to the j-layer set V E j for any j = 0, 1, . . . , i − 1 but have an incoming edge from some vertex in the (i − 1)-layer set V E i−1 . Given a graph G(V ,Ê) with V = I and the set Ê of directed edges e whose direction is estimated by its average time delay sum AD(e) , and threshold θ , the whole process of edge set estimation is described as follows.
1. e 1 , . . . , e m ← sorted list of edges e ∈Ê with AD(e) > θ in descending order of AD(e). 2. For e = e 1 , . . . , e m , remove the edges e = (i, j) ∈Ê if there exists an indirect path from i to j. 3. Set VÊ 0 to the set of vertices in V whose indegree is 0. 4. Set i to 1. Repeat the followings until V \ i−1 j=0 VÊ j = ∅ : set VÊ i to the set of vertices in V \ i−1 j=0 VÊ j that has an incoming edge from a vertex in VÊ i−1 , and then increase i by 1. 5. Remove all the edges (i, j) ∈Ê whose end points i, j belong to the same layer VÊ k for some k ∈ [N].
Example. Figure 1 is the summary diagram of our method with a toy example. In the example, state time series s 1 , . . . s 5 for five individuals 1, . . . , 5 are assumed to be observed and the average time delay sum of every pair is calculated. (The number written on each edge indicates its average time delay sum.) Threshold θ is set to 15.6 because it is the first valley position larger than the highest peak position around 10 in the distribution of the average time delay sum estimated from the set of the average time delay sums {1, 2, 9, 10, 10, 10, 11, 11, 20, 21} using kernel density estimation. In this case, the average time delay sum 20 and 21 of edges (1,3) and (1,4), respectively, are more than θ , so in descending order of average time delay sum, first, edge (1, 3) is checked if there exists an indirect path from vertex 1 to vertex 3 and removed because there exists, then, edge (1, 4) is checked similarly and removed. Finally, the five vertices are divided into three layers by their path lengths from vertex 1, which is the vertex with indegree 0, and the estimated edge set Ê is made by removing all the edges between the same layer's vertices. In the last procedure, edges (2, 5) and (4, 3) are removed in our example.

Numerical simulations and application to real-world datasets
In this section, we experimentally show effectiveness of our method using synthetic and real world datasets. The gap-based cost w defined by Eq. (2) with α = 3 is used by the proposed method using gap-based cost in all the experiments for binary state propagation. Generating graphs and plots in our experiments was executed in Mathematica 28 .
Experiments using synthetic datasets. First, we evaluate how accurate the estimated edge set Ê by the proposed method is for the real-valued and binary state sequence dataset generated from a delay model with a given ground truth propagation graph G(V, E).

Ground truth graphs and datasets. [Real-valued State Propagation]
We generate the dataset using ground truth propagation graph G(V, E) shown in Fig. 2a. The length-100 time series s i [1] · · · s i [100] for each vertex (individual) i = 1, . . . , 10 are generated by the following steps. Note that in(i) denotes the set of vertices from which edges come to vertex i.
[Binary state propagation] The dataset is generated by propagation model in which individuals are located in 2-dimensional real space and state-1 of individual j is propagated from individuals i within some distance, then the ground truth graph G(V, E) is generated from the dataset and individuals' location information. Note that the proposed method estimates E without individuals' location information. Given a parameter 0 < p ≤ 1 of the state-1 propagation probability, the length-200 time series s i [1] · · · s i [200] for each vertex (individual) i = 1, . . . , 50 is generated as follows.
Step 1 For ε ←random value generated according to N(0, 1) where | · | denotes the number of elements in set ' · ' . Then, E is defined as A ground truth graph G(G, E) for one dataset with p = 0.95 is shown in Fig. 3a.
Using directed edge set E of the ground truth propagation graph, we evaluate an estimated directed edge set Ê in terms of precision (Prec), recall (Rec) and F-measure (FM) defined as Note that our method cannot rank the edges, so evaluation using precision-recall or ROC curve is difficult. How to balance precision and recall depending on applications is one of our future research issues.
In our setting, time series of vertices in the same layer are similar to each other even if their incoming edges are different. In that sense, it is impossible to correctly guess incoming edges, that is, from which vertices in the previous layer the states were propagated directly. Thus, we also evaluate Ê in terms of looser measures. We can also consider layer partition VÊ 0 , VÊ 1 , · · · for G(V ,Ê) like layer partition V E 0 , V E 1 , · · · that is defined in the section titled "Edge Set Estimation" for the ground truth propagation graph G(V, E). Then, we define layer accuracy (LA) and Mean layer difference (MLD) of Ê as where % is modulus operator. If there are multiple candidates for D i,j , we adopt D i,j with the smallest absolute value. Using D i,j , propagation direction is estimated as i → j if D i,j > 0 and j → i if D i,j < 0 . We construct estimated edge set Ê of the baseline method by applying the procedure proposed in the section titled "Edge Set Estimation" using D i,j instead of the average time delay sum of s j from s i .

Parameters of kernel density estimators.
In all the experiments, we use Gaussian kernel in the kernel density estimation. The results of all the simulations were almost the same for other kernels: Biweight, Cosine, Epanechnikov and Triangular. As for the bandwidth h, we use the following rule of thumb: where β is a positive constant, σ is the standard deviation of the samples, Q 1 and Q 3 are the lower and upper quartiles, respectively, and n is the sample size. Constant β is set to 0.9 in Silverman's rule of thumb 30 . In the experiments using synthetic datasets, β is set to the value with the minimum MLD that is found by a grid search in {0.01, 0.02, 0.03, . . . , 1.99, 2.00}. Table 1.

Results. [Real-valued state propagation] Performance comparison with the baseline method by the evaluation measures is shown in
Note that the values in the table are averaged over 100 datasets and the parenthesized values are their 95% confidence intervals. Our method significantly outperforms the baseline method in all the measures for (τ 1 , τ 2 ) = (0, 1) and have comparable performance to it for (τ 1 , τ 2 ) = (1, 2) . The reason for performance degrade of the baseline method in the case with (τ 1 , τ 2 ) = (0, 1) is guessed to be its coarse estimation; it can distinguish one-layer (direct) and two-layer (indirect) propagation differences for (τ 1 , τ 2 ) = (1, 2) because their expected average delay times are 1.5 and 3 whose nearest integer sets are {1, 2} and {3} , respectively, so no intersection exists between them, but for (τ 1 , τ 2 ) = (0, 1) , it cannot distinguish their differences because their expected average delay times are 0.5 and 1 whose nearest integer sets are {0, 1} and {1} , respectively, so 1 is a common value. The estimation of our proposed method is fine enough for distinguishing such differences.
The estimated propagation graph G(V ,Ê) by the proposed method for one of the synthetic datasets is shown in Fig. 2c. Parameter θ is set to 180.7 from the estimated distribution (Fig. 2b). For this dataset, there are some falsely detected edges but all the edges in E are correctly detected and all the falsely detected edges keep the correct layer structure. (Fig. 2c). The frequencies of the best grid values β for the bandwidth of kernel density estimation are shown in Fig. 2d, which says that β = 0.15 ∼ 0.30 are appropriate for these datasets.
[Binary state propagation] Performance comparison with the baseline method by our evaluation measures is shown in Table 2. The proposed method outperformed the baseline method in all the measures except precision and for all the p values except 0.5. Precisions of both the methods are low compared to their recalls, that is due to correct edge (directly affecting edge) definition: location information is used to define the ground truth graph edges but such information is not used in this experimental setting. Our method successfully estimates each individual's belonging layer with high LA and low MLD when p is around 1 and keeps LA about 0.8 even for p = 0.6.
The estimated graph G(V ,Ê) by the proposed method for one of the datasets with p = 0.95 is shown in Fig. 3c. For the dataset, parameter θ is set to 287.2 from the estimated distribution (Fig. 3b). There are many falsely detected edges but all the edges in E are correctly detected and all the falsely detected edges keep the correct layer structure. The frequencies of the best grid values β for the bandwidth of kernel density estimation are shown in Fig. 3d, which says that β = 0.2 ∼ 0.9 are appropriate for these datasets. Table 1. Estimation performance of the baseline method and our proposed method using warping-based cost averaged over 100 datasets. We used Gaussian kernel with a bandwidth with the best β. www.nature.com/scientificreports/ Application to real-world datasets. For real-world datasets, there is no ground truth graph so the measures adopted for synthetic datasets cannot be used to evaluate performance. Thus, only what we can do is to visually check the consistency of the estimated propagation graphs with given datasets. As for parameters of kernel density estimation, we use Gaussian kernel and set the bandwidth-related parameter β to 0.25 because β = 0.2 ∼ 0.3 are appropriate values for both the real-valued and binary state propagations in the experiments using synthetic datasets.
Stock price analysis. We report our analysis of stock price propagation by the proposed method. Stock price fluctuates greatly and its propagation is ambiguous and the propagation direction often changes. In that sense, it does not seem to satisfy Assumption 1, but our method can be used to extract a trend such as which stock price tends to follow which stock price during the period in total. Here, we show the result of such trend analysis using opening stock price time series for one year period. We  Figure 4b shows the estimated propagation graph with the vertices of 17 sectors by the proposed method for threshold θ = 26.6 , which is determined from estimated distribution of average time delay sum (Fig. 4a). Figure 4c shows the minimum cost path between the time series s 9 and s 17 in the dynamic programming table for calculating the minimum cost, which is composed of the optimally matched positions between the two time series. The horizontal and vertical axes are positions of s 9 and s 17 , respectively, and the points above the diagonal line (black points) mean that s 9 is delayed from s 17 at those positions and the points below the diagonal line (light blue points) means that s 17 is delayed from s 9 at those positions. The average time delay sum of s 9 from s 17 is 22.0, which means that s 9 tends to follow s 17 in total. In fact, comparing to the diagonal line, there are more above points than the below points. Figure 4d shows the line graph of time series s ′ 9 and s ′ 17 with gray and light blue lines connecting their corresponding matched positions in the alignment. You can see that s 9 (derivative of s ′ 9 ) follows s 17 during two long time periods [59,77] and [193,208] with small time delays.
Among the set of pairs of individual stocks, stock pairs that have clearer leader-follower relationship can be found. Figure 4e shows the standardized sequences of the opening price for one of those pairs ("NAGAWA", "KYOKUTO BOEKI KAISHA") with the lines connecting corresponding points between them. In the figure, you can see that black stock (NAGAWA) follows blue stock (KYOKUTO BOEKI KAISHA) with large time delay during period between 60 and 190.
Cell's firing analysis. We applied our method to estimating firing state propagation order of biological cells. The dataset is composed of 250-frame {0, 1}-state and 2D-location sequences of 172 cells, where states 1 and 0 represent firing and not firing, respectively. Our method uses state sequences alone and location sequences are used only for result visualization.
We used the data of 144 cells except for 28 cells which could not be measured properly due to noise. From the set of 144 binary sequences with length 250, we extracted 4 datasets I 1 , I 2 , I 3 and I 4 , each of which is composed Table 2. Estimation performance of the baseline method and our proposed method using gap-based cost averaged over 100 datasets for 7 values of parameter p:0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 1.00. We used Gaussian kernel with a bandwidth with the best β. www.nature.com/scientificreports/ of 144 length-100 consecutive subsequences starting at frame t = 1, 51, 101 and 151, respectively, of the original length-250 sequences. The layer partitions of the estimated graphs by the proposed method for thresholds θ = 67.1(I 1 ), 52.9(I 2 ), 11.4(I 3 ), 10.5(I 4 ) are shown in Fig. 5a, where θ s are determined from estimated distributions of average time delay sum (Fig. 5b). For datasets I 2 , I 3 and I 4 , the first layer's cells look located around the lower right and the last layer's cells look located around the upper left, and the locational direction of layer sequence VÊ 0 , VÊ 1 , · · · looks from the lower right to the upper left. Figure 6a shows {0, 1}-state sequences in dataset I 4 . We can see that cells with similar sequences are classified into the same layer. Appropriateness of the estimated layer order can be also confirmed by Layer-consensus state sequences shown in Fig. 6b.

Concluding remarks
We proposed the way of constructing a state propagation graph that visualizes the estimated state propagation order of individuals. According to our experiments using real-valued and symbolic time series synthetic datasets generated by stochastic delay models, the edge sets of propagation graphs estimated by our method achieved comparable or higher F-measure and layer accuracy than those by a baseline method (optimal constant delay a b c d e www.nature.com/scientificreports/ model), where layer accuracy is the accuracy on the number of steps to be taken in propagation from the source individuals to each individual. In order to demonstrate practical usefulness of our method, we applied our method to propagation analyses of stock price and biological cell firing. For both datasets, the propagation order estimated by our proposed method is shown to be consistent with visually recognizable propagation order. The propagation delay is not stable for stock price propagation, but which stocks tended to follow which stocks in a given period is interesting information and automatic visualization may be useful to investors. Our method is considered to be useful for analyses of such unstable propagation.